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ADAPTIVE NONPARAMETRIC BAYESIAN INFERENCE USING 
LOCATION-SCALE MIXTURE PRIORS 

By R. de Jonge and J. H. van Zanten 1 

Eindhoven University of Technology 

We study location-scale mixture priors for nonparametric statis- 
tical problems, including multivariate regression, density estimation 
and classification. We show that a rate-adaptive procedure can be 
obtained if the prior is properly constructed. In particular, we show 
that adaptation is achieved if a kernel mixture prior on a regression 
function is constructed using a Gaussian kernel, an inverse gamma 
bandwidth, and Gaussian mixing weights. 

1. Introduction. In Bayesian nonparametrics, the use of location-scale 
mixtures of kernels for the construction of priors on probability densities is 
well esthablished. The methodology is used in a variety of practical settings, 
and in recent years there has been substantial progress on the the math- 
ematical, asymptotic theory for kernel mixture priors as well; cf. [3, 5, 6, 
15, 23, 29]. At the present time, we have a well-developed understanding of 
important aspects including consistency, convergence rates, rate-optimality 
and adaptation properties. A similar, parallel development has taken place 
in the area of beta mixture priors; cf. [4, 14, 20, 21]. 

A discrete location-scale mixture of a fixed probability density p on R rf 
can be expressed as 



(1.1) sl _»jr Wj p 



where m G N, xi,..., x m G M. d , w\, . . . , w m > and Yl w j = 1> an d c > 0. 
A prior on densities is obtained by putting prior distributions on m, the 
locations Xj , the scale a and the weights Wj . When p satisfies some regularity 
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conditions, a wide class of probability densities can be well approximated by 
mixtures of the form (1.1). This indicates that if the priors on the coefficients 
are suitably chosen, the resulting prior and posterior on probability densities 
can be expected to have good asymptotic properties. The cited papers give 
precise conditions under which this is indeed the case. 

Obviously, a much wider class of functions is well approximated by mix- 
tures of the form (1.1) if we lift the restriction that the weights Wj should 
be nonnegative and sum up to 1. This suggests that location-scale mixtures 
might be attractive priors not just in the setting of density estimation, but 
for instance also in nonparametric regression. Although this idea has been 
proposed in the applied literature; cf., for example, [11, 22], it does not 
seem to have attracted a great deal of attention. The few examples do show 
however that the approach can yield quite satisfactory results. 

In the paper [22], location-scale mixture priors are used in an astrophysical 
setting for the analysis of data from galatic radio sources. The statistical 
problem essentially boils down to a bivariate, nonparametric, fixed design 
regression problem. The use of a mixture prior is natural in that particular 
application because it reflects the idea that the function of interest, which 
describes the strength of the magnetic field caused by our planet and its 
"neighborhood" in space, is in fact an aggregate of contributions from a 
large number of locations, with different weights, which can be positive or 
negative. 

Another reason for using a location-scale mixture prior in multivariate 
regression, instead of for instance the popular Gaussian squared exponential 
or Matern priors, are computational advantages. Conditional on the gridsize 
m the prior only involves finitely many terms, so no artificial truncation 
or approximation is necessary for computation. As argued also in [22], the 
mixture prior allows to avoid the inversion or decomposition of nontrivial and 
often ill-behaved nxn matrices (with n the sample size), which can become 
cumbersome already for moderate sample sizes (cf. also the discussion in [1]). 
In the astrophysical application of [22], the sample size is of the order 1500 
and it is shown that samples of this order can be dealt with effectively using 
kernel mixture priors. 

On the theoretical side, little or nothing seems to be known for kernel 
mixture priors in a regression setting. In the present paper, we therefore take 
up the study of asymptotic properties, in order to assess the fundamental 
potential of the methodology and to provide a theoretical underpinning of its 
use in practice. We will show that if the kernel and the priors on locations and 
scales are appropriately chosen, kernel mixture priors yield posteriors with 
very good asymptotic properties. It is well known that for the estimation of 
an a-regular function of d variables, the best possible rate of convergence is 
of the order n - a /( d + 2a ) ? where n is the number of observations available. We 
will prove that up to a logarithmic factor, this optimal rate can be attained 
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with location-scale mixture priors. More importantly, the near optimal rate 
can be achieved by a prior that does not depend on the unknown smoothness 
level a of the regression function. In other words, we can obtain a fully 
adaptive procedure. 

The bounds for the convergence rates that we will obtain depend crucially 
on the smoothness of the kernel p that is used. For kernels with only a finite 
degree of regularity, we get suboptimal rates. We only obtain the optimal 
minimax rate (up to a logarithmic factor) for kernels that are infinitely 
smooth, in the sense that they admit an analytic extension to a strip in 
complex space. The standard normal kernel is an example of an optimal 
choice in this respect. We also have to put (mild) conditions on the priors 
on the grid size m and the scale a. In particular, the popular inverse gamma 
choice for the scale is included in our setup. 

Perhaps surprising is the fact that although we use a probability density 
p to construct the mixtures, we can still achieve adaptation to all smooth- 
ness levels. Intuition from kernel estimation might suggest that when p is 
a centered probability density, we have good approximation behavior for 
regression functions with regularity at most 2, and that for more regular 
functions we should use higher order kernels. This turns out not to be the 
case however. To prove this fact, we adapt an observation of Rousseau, who 
uses a similar idea to prove that for densities on the unit interval, using 
appropriate mixtures of beta densities yields adaptation to all smoothness 
levels; see [21]. The recent preprint [15], which was written at the same 
time and independently of the present work, employs the same idea to prove 
adaptation for kernel mixture priors for density estimation. In the present 
paper, we extend the technique to a multivariate setting (see Lemma 3.4 
ahead) . 

The literature on Bayesian adaptation is still relatively young. Earlier 
papers include [2, 9, 10, 12, 17, 21] and [26]. Priors that yield adaptation 
across a continuum of regularities in nonparametric regression have been 
exhibited in [12], where priors based on spline expansions are considered, 
and [26], which uses randomly rescaled Gaussian processes as priors. 

The location-scale priors we consider in this paper are conditionally Gaus- 
sian, since we will put Gaussian priors on the mixing weights. This allows us 
to use the machinery for Gaussian process priors developed in [27] and [28] 
in our proofs. Other technical ingredients include metric entropy results for 
spaces of analytic functions, as can be found, for instance, in [13], and the 
connection between metric entropy and small deviations results for Gaus- 
sian process (cf. [16, 18]). We will obtain a general result for a conditionally 
Gaussian kernel mixture process, which can in fact be used in a variety of 
statistical settings. To illustrate this, we present rate of contraction results 
not just for nonparametric regression, which is our main motivation, but 
also for density estimation and classification settings. 
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In the next section, we present the main results of the paper. In Sec- 
tion 2.1, we state a general result for a conditionally Gaussian location-scale 
mixture process whose law will be used to define the kernel mixture prior in 
the various statistical settings. Rate of contraction results for nonparametric 
regression, density estimation and classification are given in Section 2.2. The 
proof of the general theorem can be found in Section 3. 

1.1. Notation. 

• Ssz, §tz: imaginary and real part of a complex number z. 

• N = NU{0}. 

• For k <E Njj: k. = h + ■ ■ ■ + k d , k\ = h\ ■ ■ ■ k d \. 

• f * g: convolution of / and g. 

• a V b = max{o, b}, a A b = min{a, b}, a + = a V 0. 

• C(X): continuous functions on X. 

• C a (X) for a > and X C R rf : functions on X with bounded partial deriva- 
tives up to the order (3, which is the largest integer strictly smaller than 
q, and such that the partial derivatives of order /3 are Holder continuous 
of order a — (3. For / € C a (X) we denote by \\f\\ a the associated Holder 
norm of /; cf. [25], Section 2.7.1. The Holder ball of radius R > is defined 
aS C%(X) = {feC a (X):\\f\\ a <R}. 

2. Main results. 

2.1. General result for Gaussian location-scale mixtures. On a common 
probability space, let M be an N- valued random variable, E a (0, oo)-valued 
random variable and : k € N rf ) standard Gaussian random variables, all 
independent. The stochastic process W indexed by [0, l] d is defined by 

(-) "M- E ^jkM^) 

ke{l,...,M} d 

for x S [0, l] d , where p : M. d — > R is a function that belongs to the class ^ 7 of 
7-regular kernels defined as follows. 

Definition 2.1. For 7 G (d/2,oo], an integrable function p on R^ be- 
longs to ^ 7 if J Kd p(x) dx = 1, it is uniformly Lipschitz on R rf , it has finite 
moments of every order, and it satisfies one of the following conditions, 
depending on whether 7 < 00 or 7 = 00: 

• For 7 < 00: p belongs to C 7 (R d ). 

• For 7 = 00: p is the restriction to R^ of a function that is defined on the 
set 5 = {(z±, . . . , Zd) £ C d : \$tZj\ < 1 for j = 1, . . . , d}, and that is bounded 
and analytic on S. 
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Examples of kernels belonging to <^ 7 for 7 < 00 are abundant. Using 
Fourier inversion, it is not difficult to see that an integrable function p be- 
longs to if it has a characteristic function 



which is infinitely often differentiable at 0, which satisfies tp(0) = 1, and 
which satisfies the exponential moment condition 



The prime example is the standard normal density on M. d , which is easily seen 
to belong to =^oo- Note that we do not require that p > in Definition 2.1. 
So, in fact, higher order kernels are allowed as well. 

The index 7 of the class of kernels quantifies the regularity of the kernel 
that is employed. We will see that this regularity influences the rate of 
convergence that we can obtain for the corresponding location-scale mixture 
prior. The restriction 7 > d/2 is connected to the fact that in order to obtain 
bounds for the process W independent of M, we want the process in (2.1) 
to be well defined if the sum is taken over all k in N rf . 

For e > 0, the metric entropy of a set B in a metric space with metric d is 
defined as logN(e,B,d), where N(e,B,d) is the minimum number of balls 
of radius e needed to cover B. Fix < a < b < 1 and define 3C = [a, b] d . Let 
d 1 = 2d{d + ^)/{2^ -d) and 5 1 = d/(2-y - d). 

Theorem 2.2. Suppose that pe^forje (d/2, 00], that P(M = m)> 
Cm~ s for some C > 0, s > 1, and that E has a Lebesgue density g that, for 
some Di, D2, -D3, -D4 > andq,r>0, satisfies 



for all a in a neighborhood ofO. 

Then if wq £ C a (JT ) for a > 0, there exist for every constant C > 1 mea- 
surable subsets B n of C([0, l] d ) and a constant D > such that, for n large 
enough, 





(2.2) J D lfJ ^ e -^(V-)<Mi°gi/-r < 5((7 ) < Dsa -Q e 



D 4 (l/a) d -r(logl/crr 



(2.3) 

(2.4) 



logJV(e„,B, 




(2.5) 




Here if 7 < 00 



n 



a/(d- 1 +2a(l+5- 1 )) 



(a(l-(d5 7 )/(2 7 )))/((«i y +2a(l+A y ))(l+d/(27))) 



G 
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and if 7 = 00 



n 



n 



n 



n 



a/(d+2a) 1q! 



(rV(l+«Q)/(2+d/a) 



(■rV(l+d))/(2+d/a)+(l+d-r)/2 + 



77,. 



A few remarks about the result are in order. First of all, the process W 
is indexed by the unit cube, but the supremum in (2.5) is over the strictly 
smaller set 3£ . This is due to the fact that to obtain good enough approx- 
imations of the given function wo defined on 3£ by location-scale mixtures 
of the kernel p, we also need kernels centered at points just outside the 
set 3£ . A result like (2.5) with the supremum over the entire unit cube is 
only possible under additional assumptions on the boundary behavior of the 
function wq. 

Theorem 2.2 connects to existing results for nonparametric Bayes proce- 
dures, which give sufficient conditions of the form (2.3)-(2.5) for having a 
certain rate of posterior contraction; cf., for example, [7, 8, 24]. In the next 
subsection, we will single out the most important particular cases. In all 
cases, the statistical results will state that the posterior will asymptotically 
concentrate on balls of radius of the order e n around the true parameter (rel- 
ative to a natural statistical metric depending on the specific setting). Note 
that in the case 7 < 00, this means we only obtain a rate if (d5~)/(2 n f) < 1, 
which is true if and only if 7 > (1/4)(1 + \/5)d~ (0.81)d. In particular, the 
choice j>d suffices to have consistency. As the smoothness 7 of the ker- 
nel p that is employed is increased, the rate of contraction improves. Since 
dy — >■ d and <5 7 — > as 7 — > 00 , the power of n _1 in the expression for the rate 
e n tends to a/(d + 2a) as 7 — > 00, which corresponds to the optimal mini- 
max rate of convergence for estimating an a-regular function of d variables. 
If an analytic kernel p G is used the minimax rate n~~ a l ( d + 2a ) itself is 
attained, up to a logarithmic factor. 

The proof of the theorem is deferred to Section 3. In the next subsection, 
we give the precise rate of contraction result for nonparametric regression, 
density estimation and classification settings. The first case, which was the 
original motivation for this study, is worked out in some detail. The analo- 
gous results for the second and third settings are presented more briefly, to 
avoid unnecessary duplications. 

2.2. Rate of contraction results for specific statistical settings. 

2.2.1. Regression with Gaussian errors. Consider a multivariate regres- 
sion problem where we have known design points Xi,X2, ■■■ € 3C = [a, b] d 
for some a < b and d G N, and we observe real- valued variables Y±, . . . ,Y n 
satisfying the regression relation 



Yi = e(xi)+£i 
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for 9 : X — > K an unknown regression function and error variables £j that 
are independent and Gaussian, with mean and variance r 2 . We assume 
that < a < b < 1, so that the design space 3t> is strictly contained in the 
interior of the unit cube in R^. 

As prior on the regression function, we employ the law lie that the 
stochastic process W defined by (2.1) generates on the space C of con- 
tinuous functions on 3£ . The total prior II on the pair (9, r) is then defined 
by Il(d9, dr) = He(d0) x ilr(dr), for Ht a prior on a compact interval that 
is assumed to contain the true value to, with a Lebesgue density that is 
bounded away from 0. 

The posterior distribution for (9,t) given the data Y\,...,Y n is denoted 
by II(- | Yi, . . . ,Y n ). By Bayes formula, it is given by the expression 



U(B\Yx,...,Y n 



J B L(9,T;Y 1 ,...,Y n )U(d9,dr) 
jL(9,r;Y 1 ,...,Y n )IL(dO,dr) ' 



where 



L(9, r; Yl ,...,Y n ) = ^-L^ exp " '(*0) 2 J 

is the likelihood. For a given sequence of positive numbers e n \. 0, the pos- 
terior is said to contract around the true parameter (#o> r o) a t the rate e n if 
for L > sufficiently large, 

U[(9,t):- y>fo) - O (^)) 2 + \r- r | 2 > L 2 e 2 n \Y U ..., Y n ) ^ 



n 



3=1 



as n — > oo , where the convergence is in probability under the true distribu- 
tion governed by (0q,tq). This means in particular that asymptotically, the 
marginal posterior for 9 is concentrated on balls with radius of the order e n 
around the true regression function 9q, where we use the natural L 2 -norm 
associated to the empirical measure of the design points to measure distance. 

The next theorem follows from Theorem 2.2, in combination with the re- 
sults in [7] (slightly adapted like Theorem 2.1 of [5] in the density estimation 
case; cf. also the discussion following Theorem 3.1 of [26]). 

Theorem 2.3. Suppose that the conditions of Theorem 2.2 are fulfilled. 
Then if 9q E C a {&) for a > 0, the posterior contracts at the rate 

n -a(l-(c» y )/(27))/((d y +2aa+* y ))(l-M/(a7))) j 

if 7 < oo , or at the rate 

n - a /(d+2a) ^ (rV(l+d))/(2+d/a)+(l+d-r)/2 + 

if 7 = oo . 
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As discussed above already the choice p € yields the best rate of 
contraction, namely the optimal minimax rate, up to a logarithmic factor. 
Also note that the prior does not depend on the unknown regularity a of 
the true regression function, so the procedure is rate-adaptive. Observe that 
for p G and r = 1 + d we obtain the rate (n/ log l+d n)~ a ^ d+2a \ If r is 
strictly larger or smaller than 1 + d, we get a slightly worse rate, in the sense 
that the power of the logarithm in our upper bound for the rate increases. 

In the following corollary, we single out the important special case of a 
standard Gaussian kernel and an inverse gamma prior (or a power of it in 
the multivariate case) on the scale. 

Corollary 2.4. Suppose that p is the standard Gaussian density on 
M. d , H d is inverse gamma, and M is such that P(M = m) > Cm~ s for some 
C > and s > 1. Then if 9q € C a ( JT) for a > 0, the posterior contracts at 
the rate 

n -a/(d+2a) log (4a+4ad+d+d 2 )/(4a+2d) n 

Proof. Simply note that the standard normal kernel belongs to 
and that if S rf has an inverse gamma law, then (2.2) is satisfied with r = 0. 
□ 



2.2.2. Density estimation. Let X±, . . . ,X n be a sample from a positive 
density /o on the set = [a, b] d , for < a < b < 1. The aim is to estimate 
the unknown density. 

We consider the prior LI on densities defined as the law that is generated 
on the function space C{3.) by the random function 

e W(x) 
l x e w (y) dy 

for W the process defined by (2.1). In this case, we say that the posterior 
II(- | X\, . . . ,X n ) contracts around the true density /o at the rate e n if for 
all L > large enough, 

U(f:h(f,f o )>Le n \X 1 ,...,X n ) P 40 

as n— > oo, where h is the Hellinger distance. 

Theorem 2.2, the general rate of contraction results for Bayesian density 
estimation (cf. [5, 8]) and the relations between the uniform norm on the 
paths of W and the relevant statistical metrics on the densities (2.6) (cf. [27]) 
yield the following result. 



Theorem 2.5. In this setting, the assertions of Theorem 2.3 and Corol- 
lary 2.4 are true for 6q = log fa. 
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2.2.3. Classification. Consider i.i.d. observations (Xi,Y\), . . . , (X n ,Y n ), 
where the X{ take values in the set S£ = [a, b] d , < a < b < 1, and the Yi 
take values in {0, 1}. The aim is to estimate the regression function r$(x) = 
F(Y 1 = l\X 1 = x). 

As prior on ro, we use the law II of the process ^f(W), where W is as 
in (2.1) and the link function :M — >■ (0, 1) is the logistic or normal distri- 
bution function. Let II(- | (X±, Y{), . . . , (X n , Y n )) denote the corresponding 
posterior and let G be the distribution of the covariate X\. With || • || 2,c the 
associated L 2 -norm, we say that the posterior contracts around the truth ro 
at the rate e n if for all large enough L > 0, 



Il(r:\\r-r \\ 2 , G >Le n \(X 1 ,Y 2 ),...,(X n ,Y n )) 



Prn 







as n — > oo. 

Theorem 2.2, the general rate of contraction results (cf. [8]) and the rela- 
tions between the relevant norms (cf. [27]) yield the following result. 

Theorem 2.6. In this setting, the assertions of Theorem 2.3 and Corol- 
lary 2.4 are true for 9q = ^/ _1 (ro). 

3. Proof of Theorem 2.2. We will find the appropriate sieves B n and 
derive the inequalities (2.3)-(2.5) by using the fact that conditionally on the 
grid size M and the scale E, the process W is Gaussian. For fixed m £ N 
and a > 0, we define the stochastic process (W m,a (x) : x G [0, l] d ) by setting 

k£{l,...,m} d v ' 

In the following subsection, we first study some properties of the Gaussian 
process W m,(7 that we will need to establish (2.3)-(2.5). 

3.1. Properties ofW m ' a . Recall that in general, the reproducing kernel 
Hilbert space (RKHS) H attached to a zero-mean Gaussian process X is de- 
fined as the completion of the linear space of functions 1 1— > MX(t)H relative 
to the inner product 

{EX(-)H 1 ,EX(-)H 2 ) m = EH 1 H 2 , 

where H, Hi and H 2 are finite linear combinations of the form Y2i a i-^( s i) 
with dj G R and Sj in the index set of X . The following lemma describes the 
RKHS of the process W m,a . It is a direct consequence of a general result 
describing the RKHS of a Gaussian process admitting a series expansion; cf. 
Theorem 4.2 of [28] and the discussion following it. 
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Lemma 3.1. The reproducing kernel Hilbert space M m ' a ' of W m ' a con- 
sists of all functions of the form 

(3.1) h( x )= Yl Wk v dP { x ~ k J m \ ^M*, 

ke{l,...,m} d 

where the weights range over the entire set of real numbers. The RKHS- 
norm is given by 

(3.2) \\h\g m „=m d mm £ w% 

fce{l,...,m} d 

where the minimum is over all weights for which the representation (3.1) 
holds true. 

We remark that if the functions x i— >• p((x — k/m)/o~) on [0, l] d are linearly 
independent, then the representation (3.1) of an element of the RKHS is 
necessarily unique and hence the minimum in (3.2) can be removed. For our 
purpose, it is, however, not important that these functions are independent 
for every fixed a and m. 

Next, we consider the so-called centered small ball probabilities of the 
process W m ' a , which are determined by its reproducing kernel Hilbert space. 
We use well-known results by Kuelbs and Li [16] and Li and Linde [18] that 
relate the metric entropy of the unit ball in the RKHS to the centered small 
ball probabilities of the process. The unit ball H™' CT in the reproducing kernel 
Hilbert space W m ' tT is the set of all elements h G W m ' CT such that ||/i||H m ^ < 1- 

To find an upper bound for the metric entropy of the unit ball, we embed 
it in appropriate space of functions for which an upper bound for the entropy 
is known, depending on the value of 7. First, we consider the case 7 < 00. Let 
h be an element of HP Ti,CT . By Lemma 3.1, it admits a representation (3.1), 
with the weights such that ||/i||j[ m , CT = m ds ^w\. If p G <^ 7 with 7 < 00, 
we get that h G C 7 ([0, l] d ) and ||/i|| 7 < o-~ (d+7) ||p|| 7 ||/i||H-.-- Hence, we have 
HTf ,CT C C£([0, l] d ) in this case, where R = o--( d +^ \\p\\ 7 . For 7 = 00 and h as 
before, it follows from the assumptions on p that the function h is in fact 
well defined on S a = {z G : Vj \Qzj\ < a}, is analytic on this set and takes 
real values on M. d . By the Cauchy-Schwarz inequality, it follows that 

v fce{l,...,m} d 7 He{l,...,m} d 
The last factor on the right-hand side is bounded from above by a multiple 
of m d on the set S a . Hence, we obtain 

(3.3) \h(z)\ < Ka- d \\h\\ m ^ 

for every z G S a , where the constant K only depends on the density p. Let 
the set of all analytic functions on S a , uniformly bounded by Ka~ d on 
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that set, with K the same constant as in (3.3). The preceding shows that 
for the RKHS unit ball we have Mf'" C % if 7 = 00. 

We see that in all cases we can embed the RKHS unit ball HI™'' 7 in a 
function space independent of m, for which the metric entropy relative to 
the supremum norm on [0, l] d is essentially known. We have the following 
result. 

Lemma 3.2. 1/7 < 00, then 



for e E (0,£o) and o £ (0, o"o), with constants K\,Ki > that do not depend 
on e or a. For a > o~q, it holds that 



for all e £ (0,£o); with K3 > a constant independent of e and a. 

Proof. The first statement is well known; see, for instance, Theo- 
rem 2.7.1 of [25]. The second statement is similar to the classical result 
given by Theorem 23 of [13], which gives the entropy for the class of ana- 
lytic functions bounded by a constant on a strip in complex space. However, 
the proof of the present statement requires extra care to identify the role 
of cr, because it should not be considered as an irrelevant constant in our 
framework. We omit the details, since the proof of Lemma 4.5 of [26] is very 
similar. □ 

In view of the observations preceding Lemma 3.2, we now have entropy 
bounds for the unit ball of the RKHS in all cases. Using the results from [16] 
and [18], these translate into results on the centered small ball probability of 
W m ' a . The first statement of the following lemma follows from the preceding 
lemma in combination with the results of [18]. The second statement is 
derived from Lemma 3.2 by arguing as in the proof of Lemma 4.6 in [26]. 

Lemma 3.3. // d/2 < 7 < 00, 




for all a, e > 0, with Kq a constant independent of e,m and a. 
There exist Eq , ctq > such that 






for all e,o~ > 0, with Kq a constant independent of e and a. 
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If 7 = oo, there exist £q,o~o,K4 > 0, not depending on e and a, such that 

l+d 

for all e G (0,£q) an d c G (0,o"o)- -^or a > ctq we Ziaue 



-logP(||W m ' CT ||oo<e)<^5flog- 

/or a// e £ (0,£o)> where K§ > is independent of e and a. 

With condition (2.5) in mind, we now consider the noncentered small ball 
probabilities of the process W m ' a . According to Lemma 5.3 of [28], we have 
for wq G C([0, l] d ) the inequality 

(3.4) - logP(||W^ - wolloo < 26) < <f ( £ ), 

with ifiw^ the so-called concentration function, defined as follows: 

(3.5) <p^{e) = inf \\h\\ 2 mm ,„ - logP(||^||oo < e). 

(Our function wq is actually defined only on 3£ , but we will extend it to 
all of [0, l] d in an appropriate way later.) That is to say, the exponent of 
the noncentered small ball probability involves the exponent of the centered 
small ball probability that we considered above and an approximation term 
that quantifies how well wo can be approximated by elements of the RKHS. 

To obtain a suitable approximation, we need an auxiliary result con- 
cerning the approximation of a smooth function / by convolutions. Define 
m k = J y k p{y)dy for k G Nq. Next, for n G Nq we recursively define two 
collections of numbers c n and d n as follows. If n. = 1, we put c n = and 
d n = —m n /n\. For n. > 2, we define 

f<m\ V- (-!)*• A A (-l) n -m n 

(3.6) c n = - ) — - — m k d h d n = : h c„. 

kl n! 

n=l+k 
l.>l,k>l 

Note that the numbers c n and d n are well defined and that they only depend 
on the moments of p. For a function / G C a (M. d ) and a > 0, we define the 
transform T aa f as follows: 

P 

(3-7) T,,.„f / ;r^'/^M />/./;. 

i=i fe.=j 

Here, /3 is the largest integer strictly smaller than a and for a positive integer 
j and a multi-index k G Nq with A;. = j, D{ is the jth order differential 
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operator 

Qj 

D'J = ^ 

k dx^---dx k / 

Let p a (x) = a~ d p(x/a). 

Lemma 3.4. For a,a>0 and f G C a (R d ), we have 
\\Pa*(T a ,af)- f Hoc <K 6 a a , 
where Kq > is a constant independent of a. 

The lemma is an extension of an idea of [21], where a similar method is 
employed to approximate arbitrary smooth densities by beta mixtures. The 
proof follows the same lines but is somewhat more involved in the present 
higher-dimensional case; see Appendix. 

The following lemma deals with the approximation of the function wo by 
elements of the RKHS of the process W m '° . 

Lemma 3.5. For all a > 0, m > 1 and wq G C a {&) there exists an 
h G H m ' <J such that ||/i|| H ™.- < K 7 (l V a) and 

K 8 (1 V^ +1 ) 
sup \h{x) - wo(x)\ < — d + K 9 a , 

for K 7 ,Ks,Kq > constants independent of a and m and (3 the largest 
integer strictly smaller than a. 

Proof. Since = [a, b] d C (0, l) d , we can extend wq to all of R d in such 
a way that that the resulting function belongs to C a (M. ) and has support 
strictly inside (0,l) d . Using the operator T a>(T introduced above [see (3.7)], 
we define 

x — kjrn 



V (T a7r7 w )(k/m)— 3 — [ p(- 

{!,... ,m} d 



a 



h(x) = 

for x G [0, l] d . By Lemma 3.1, it holds that h G W n ' a and 

|2 ™,-<^ ^2 ({Ta,aW )(k/m)f < WT^woWlo- 
k£{l,...,m} d 



It follows from the definition of T a a that this bounded by a constant times 

(lvO 2 . 

It remains to prove the bound for the approximation error. By the triangle 
inequality, 

(3.8) \\h - WqIIoo < \\h-Pa * (TaaWt^Woc + \\p a * {T a ^W Q ) - W ||oo- 
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The first term on the right is the difference between the convoiution p a * 
T a ,a w o an d the corresponding Riemann sum. Using again the triangle in- 
equality, we get 

\h(x) - {p a * T a ^wo)(x)\ 

< sup | T a ^w (y)p a (x - y) - T a ^w (z)p a (x - z)\ 
\\y— a||oo<l/"i 

<\\T a>a wo\\oo sup \p a (x - y) -p a (x - z)\ 

\\y— z||oo<l/m 

+ IIPo-IIoo sup \T a ^w Q (y) - T a>CT wo(z)\. 

\\y— z||oo<l/m 

Now use the facts that T at0 -wo is bounded by a constant times 1 V ■ p& is 
bounded by a~ d times a constant, p is Lipschitz and the definition of T a ^ a WQ 
to see that 

III,- *T II ^ gl(lV^) , g 2 (lV^) ^ C 3 (1V^ +1 ) 
11 P-^^OllooS al+dm + adma _p S a l +dm a-p , 

which covers the first term on the right-hand side of (3.8). Lemma 3.4 implies 
that the second term is bounded by a constant times a a . □ 

By combining the preceding lemma with Lemma 3.3 and (3.4), we obtain 
the following result. 

Lemma 3.6. Let wq g C a {3£). 

7/"7<oo, there exist constants £o, o~q, K\, K2, K3, K4 > 0, independent of 
a and m, such that 

v / 1 \2d/(27-d) 

- logP( sup |W m ' CT (z) - w (*)| < 2eJ < ifi + A' 2 ( ^-^J 
provided that 

K 3 



a l+d m a-P 

and a £ (0,<7o). 

i/7 = oo ; £/iere exisi constants Eq, o~q, K\, K2, K3, K4 > 0, independent of 
a and m, such that 

-logf>(swp\W m >' T (x)-w (x)\<2e) < + tf 2 ^ (log — [r^j , 
provided that 

K 3 



a l+d m a-P 

and a € (0, ctq). 



+ K 4 a a < e < £ 
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3.2. Proof of Theorem 2.2. 

3.2.1. Condition (2.5). By definition of the process W and conditioning, 
p( sup|W(x) -w (x)\ <e) 

oo /*oo 

= Xm / ff(^(sup |Ty m ' <T (x)-«;o^)| <e)d<7, 

m=l ^° x€ ' r 

where A m = P(M = m). If 7 < 00, Lemma 3.6 implies that there exist con- 
stants eo, C±, C2, C3, C4 > 0, independent of a and m, such that if e < Eq 
and 



iGV /Q < a < de 1/a < 1, m > (^-(i+^/M^)^ 



x 2d/(27-d) 



2 

then 

-logP(sup|W m ' CT (x)-u;o(x)| <e) <C 3 + C 4 , 
Hence, the probability of interest is bounded from below, for e <e^, by 
E A™/ 9 (a)e Xp (-C 1 (^) 

m >C 2 e-( 1 + £i + Q )/( Q ( Q -< 3 )) 7 

> C 5 exp(-C 6 e-( a+d+7 )/ a2d /( 27 - d )) 
for constants C^,Cq > 0. It follows that condition (2.5) is fulfilled for 

(3.9) e n = M 1? i- Q /^+ 2a ( 1+ ^)) 

for Mi > an appropriate constant and <i 7 = 2d(d + 7)/(27 — <i), <5 7 = 
d/(2 7 -d). 

If 7 = 00, the same reasoning implies that there exist constants C5, Cq > 
such that, for e > small enough, 

pfsup|W(x)-^ (x)| <e) >c 5e - c ^- d/a ^ (1+d) ^\ 

It follows that, in this case, condition (2.5) is fulfilled for 

(3.10) e n = M in - a ^ d+2a ^ log* n 

for Mi > an appropriate constant, provided that t > (r V (1 + d))/(2 + d/a) . 

3.2.2. Construction of the sets B n and condition (2.4). First, suppose 
that 7 < 00 again. For L,R,e > 0, we define 



B = LCl_ {d+j)llph ([0,l] d ) + eM 1 , 



16 R. DE JONGE AND J. H. VAN ZANTEN 

where Bi is the unit ball of the space C([0, l] rf ). The sieves B n will be defined 
by making appropriate choices for the L, R and e below. Recall that in this 
case H™ ,<T C Cj-(d+ 7 )|| p || ( [0, 1]^) . Hence, by the Borell-Sudakov inequality 
(see, e.g., [19]), with $ the standard normal distribution function and for 
a>R, 

F(W m > a iB)< ¥{W m ' a i LM^" a + eBi) 

< 1 - ^(^(POlW^'loo < e)) + L). 

By Lemma 3.3, we have, for a > R and R <1, 

W m '°\\ 00 <e)>e- K * R ~ d -< £ - 2d/{2 ' , - d) 



for a constant K§ > and e > small enough. Since <3? 1 (y) > — -y/(5/2) log(l/y) 
for y G (0, 1/2), it follows that 



F(W rn ' a $ B) < 1 - *(L - ^ {b/2)K & R-^e- 2d /^- d ) 

< _-l/2(i-V \'o/2)K 6 R- d ~t e -2d/(2 7 -d))2 



for cr > i? and L > v/pjXgFVW^H. By the definition of W and 
conditioning, 

P(W (£B)<J2 x m g(a)F{W m ' a <£B)da + P(S < i?). 

m=l ^ 

By the preceding, the first term on the right is bounded by 



e -l/2(L-V(5/2)E' 6 iJ- d 7 £ -2 1 i/(27-d))2 

The assumption on g and a substitution show that the second term is 
bounded by 

/>oo 
Jl/R 

By Lemma 4.9 of [26] , this is further bounded by 

2£> 3 (l/R)V~ 2 - d "< + 1 r> 4 (l/R)d 7 pog(l/H))r < -l/2£> 4 (l/R)*?(log(l/K)) r 

dD 4 (log(l/R)Y 

for i? small enough. 

Given C > 1, we now define the sieve B n by 

B n = L n (f {d+l) ([0,l] d )+£nBi, 

-Rn \\P\\~I 
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where e n is given by (3.9). To show that (2.4) holds, we have to show we 
can choose R n and L n such that 

log r ^->Cne 2 n 



and 



(L n - V(£ 2)K,R- d ^en 2d/{2 ^ d) f > Cnel 
Observe that if we take 

Mn (d 1 +2a8 1 )/(d 1 +2a(l+8 1 )) 



for a large enough constant M, the first condition is satisfied. The second 
condition is then fulfilled if we choose 

for N large enough. 

Next, we consider the case 7 = 00. Recall that is the set of all analytic 
functions defined on the strip S a = {z £ C : Vj < o"} that are bounded 
by Ka~ d on S^. Arguing as before and now using that H™' CT C and 
%t Q % 2 if °"i > °2, we get, for L, R, e > and 5 = W R + eBi, 

p^m.o- <£B)< e -l/2(i-V( 5 / 2 )^6ii- d (log(l/( Ej R 1 + d ))) 1 + d ) 2 



for a > and L > x /(5/2) J PiT 6 i?- a! (log(l/(ei? 1 + d ))) 1 + a! . By the same condi- 
tioning argument as before, it follows that if, given C > 1, we define -B n in 
this case by 

where e n is given by (3.10), then condition (2.4) is fulfilled if we choose R n 
and L n such that 

^log'^ >Cnel 

and 

(in - ^(5/2)K 6 Rn d (log(l/(e n R 1 n +d ))) l+d ) 2 > Cne 2 n . 
Observe that we can take 

-4 = Mn d / {d+2a hog v n 
R d 

for a large enough constant M and v > 2t — r [with t as in (3.10)], and L n 
a large enough power of n. 
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3.2.3. Entropy condition. Suppose 7 < 00. For the entropy of the sieve 
B n , we have in this case, for e n >e n , 

N(2E n ,B n ,\\-\\ 00 )<N(e n ,L ri Cr L (d+7)|| „ ([0, l] d ), || • |U) 

<N(s n R d +y(L n \\p\\,),Cj([0,l] d ),\\-\U. 

Hence (see Lemma 3.2), 



IogJV(2e n ,B n ,||- |U)<^i 
This is bounded by a constant times ne\ for 

r d/(d+2 7 ) 
_ \ J-"n 

e r > 



dh 



' n ~ n ^/(d + 2y) R d(d +1 )/(d+2 7 ) ■ 

For L n and R n chosen as above, this yields 

q(l-(rf<5 7 )/(2 7 )) 

> n ci 7 + 2a(l+,5 7 ) + d(d 7 +2a(l+<5 7 ))/(2 7 ) _ 

Note that e n is always larger than e n , as was required. 

Let now 7 = 00. Arguing as before, we have in this case, for e n > e n , 

1 / L x 1 ' 
N(2e n ,B n , || • IU) < N(e n /L n ,& Rn , \\ ■ < K~ log^- 

by Lemma 3.2. With the choices of R n and L n made in this case above and 
for e n bounded from below by a power of n, this is bounded by a constant 
times n d /{d+2a) j Q gi+d+t> n ^ ^nis j s f ur ther bounded by a constant times ris^ 
for 



n 



-a/(d+2a) lQg a n ^ 



provided a > (1 + d+v)/2. The requirement that e n > e n translates into the 
condition a>t. 

APPENDIX 

Proof of Lemma 3.4. The proof is by induction on (3, which is the 
largest integer strictly smaller than a. If j3 = then a G (0, 1] and T at(T f = f 
and the statement of the claim is standard. To prove the induction step, 
suppose now that /3 > 1. By definition of T a>a f, we have 

(Pa*T ai(T f - f)(x) 

= [ Pa(y) (f(x -y)- f(x) d k o-3{D{f)(x - y)\ dy. 

V j=ik.=j J 



+ E E [^{-iy{D{f){x)^m k - d k a\p a * (DifMx] 
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By Taylor's formula and the fact that / € C a , 

fix -y)- f(x) = E {jz w~ WX*) + v), 

J=l fc.=j 

where \Rix,y)\ < C||y|| a . It follows that 

iVa*T a>(T f - f)(x) 

Paiy)R(x,y)dy 

P 

x kl 
j=i k.=j y 

The first term on the right is easily seen to be bounded by a constant times 
a a . To see that this holds for the second term as well, we use the induction 
hypothesis. 

By definition of the constants c k and d k [see (3.6)], the second term can 
be written as 

E E (^^^kiDif-Pa * iD{f))ix) - c k a\p a * iDif))ix)) . 
j=ik.=j K " J 

Now for j < (3 and k. = j, consider the decomposition 

Dif-p a *iD{f) 

= iD{f- Pa *iT a ^ ha D{f)) 

+ ip (7 *iT a ^ (7 Dif)-p (T *iD{f)). 

Since Dif G C" - - 7 , the induction hypothesis implies that the first term on 
the right is uniformly bounded by a constant times a a ~ :J . Combined with 
the first display of the paragraph, this shows that it suffices to show that 

E E (^^iMT.-^Dlf - D{f) - c k a^Dif)) = 

j=l k.=j V ' 7 

identically. Straightforward algebra shows that 

P-i 

Ta-iMf »U EE d ^ D i + M 

i=l l.=i 

Hence, 



j=i k .=j 
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i9 P-j 




EEEE 



m k d l( j i+ Wi+if 



j=l k.=j i=l l.=i 



EEf E 




fc! 



m k di )a s D s J. 



s=2 n.=s v n=l+k 
l>l,k>l 



By definition of the numbers c n and d n this equals 



s=l n.=s 



and the proof is complete. □ 
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